Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nr_repr_docs parameter #2130

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

robechun
Copy link

What does this PR do?

Adds a parameter that you can pass around for getting the nr_repr_docs. This helps when attempting to get a higher # of representative docs for a given topic.

Before submitting

  • This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes (if applicable)?
  • Did you write any new necessary tests?

@MaartenGr
Copy link
Owner

Thank you for this pull request! I believe there were a number of issues open discussing this, which issue does this PR refer to? I remember there was some discussion on how to approach this there. That perspective will help me understand what was needed and what decisions we made before going through this PR.

@robechun
Copy link
Author

I didn't actually look at past issues—it was more from my own usage that I found this added parameter to be helpful. That said, it looks like this issue is the most pertinent

@MaartenGr
Copy link
Owner

Ah, then you might have missed the contributing guideline! I think it's first important to establish what exactly is needed.

From the issue you shared as well as the connected one, it seems that users also want to re-calculate the number of representative documents after training the model.

Also, what made you choose to add this parameter to .transform rather than the init? This seems like a parameter that should go there.

@robechun
Copy link
Author

Whoops, missed that part on creating an issue! 😅

WRT what is exactly needed—AFAICT, I think there are three needs/wants based on the specific usage pattern.

  1. Where users simply use fit_transform to get the topic clusters (in my use-case)
  2. Where users want to get representative docs for online topic modeling
  3. Where users want to specify nr_repr_docs when they call merge_topics

This PR would solve for 1 and 3 (given the nr_repr_docs can now be added as an input parameter to either fit_transform and merge_topics) but not necessarily 2.

As for why I included it in transform vs init—my mental model was the following:

  • init is used for instantiating the different "lego" pieces for the BERTopic orchestration (such as clustering algorithm, representation model, etc).
  • fit_transform or merge_models is where the user can provide specifics about what they want from the model—for example, the model itself remains the same, but I could run the same model with different documents to produce different outputs. Similarly, I could run the same model with the same documents and different nr_repr_docs to extract a slightly different output (namely the set of representative docs).

Let me know if you think I'm off in any way

@MaartenGr
Copy link
Owner

This PR would solve for 1 and 3 (given the nr_repr_docs can now be added as an input parameter to either fit_transform and merge_topics) but not necessarily 2.

I think it would indeed be nice to give users the ability to choose the number of representative documents. Wouldn't it be easier to update the number of representative documents after fitting the model? That way, it would solve all 3 problems without the need to actually needing to refit the model constantly. The representative documents should not affect the representations (except for LLM-based representation models which has separate parameters for this) so it does not feel like it should be a parameter related to either init or transform. It might be easiest and more related to the task itself to put it after training the model.

As for why I included it in transform vs init—my mental model was the following:

Generally, we would put these kinds of parameter in the init since it is part of the parameter space of the model and is not specific to the behavior of transform. However, as mentioned above, it might actually be better to go with a new function that allows you to update the number of representative documents after having created the model.

@kpetrianakis
Copy link

Hello,

is this going to be merged to master? It is a useful feature that we anticipate.

Thank you

@MaartenGr
Copy link
Owner

@kpetrianakis Not yet in its current state as is shown in my response above. I believe a new function might be the most API friendly as it allows you to easily update the number of representative after training. So there is no new training pipeline needed, just extraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants